long <- read.csv("https://www.math.ntnu.no/emner/TMA4315/2021h/eliteserie.csv",
colClasses = c("factor","factor","factor","numeric"))
library(glmmTMB)
mod <- glmmTMB(goals ~ home + (1|attack) + (1|defence), poisson, data=long, REML=TRUE)
Given the random effects and the covariate the responses are conditionally independent and the conditional distribution is Poisson.
The conditional mean is linked to the linear predictor through the link function
The random effects , , are independent and identically distributed
The model gives the expected number of goals that the “attack” team will score if it plays against “defence” team The dummy variable represent the factor home having 2 levels () and it is equal to 1 for meaning that the “attack” team has home field advantage or 0 if meaning that the “defence” team has home field advantage.
summary(mod)
## Family: poisson ( log )
## Formula: goals ~ home + (1 | attack) + (1 | defence)
## Data: long
##
## AIC BIC logLik deviance df.resid
## 1147.2 1163.1 -569.6 1139.2 382
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## attack (Intercept) 0.007478 0.08647
## defence (Intercept) 0.016383 0.12800
## Number of obs: 384, groups: attack, 16; defence, 16
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.12421 0.07809 1.591 0.112
## homeyes 0.40716 0.08745 4.656 3.22e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The estimate of home advantage is 0.41 which means that the home team scores on average more goals, that is 50% more goals, regardless of team strengths which is reasonable for a footbal match. We observe that the variance of defence strengths () vary more than the attack strengths ().
To make interpretation easier we can define the strength which is the difference between attack and diffence
rf = ranef(mod)
m = data.frame(attack = unlist(rf$cond$attack), defence = unlist(rf$cond$defence))
rownames(m) = rownames(rf$cond$attack)
m$strength = m$attack-m$defence
m
## attack defence strength
## BodoeGlimt -0.036781062 -0.042616090 0.005835029
## Brann 0.012026209 -0.123934761 0.135960970
## Haugesund 0.011223106 -0.061931278 0.073154385
## Kristiansund -0.011367328 0.008112432 -0.019479760
## Lillestroem -0.049915996 0.030699257 -0.080615253
## Molde 0.078390643 -0.036630979 0.115021622
## Odd 0.003654179 -0.052013600 0.055667779
## Ranheim_TF 0.023375599 0.062209734 -0.038834135
## Rosenborg 0.050622609 -0.152631173 0.203253783
## Sandefjord_Fotball -0.058333079 0.133164228 -0.191497307
## Sarpsborg08 0.026946364 0.006574064 0.020372301
## Stabaek -0.026801293 0.085376126 -0.112177420
## Start -0.060500163 0.081958112 -0.142458276
## Stroemsgodset 0.024556017 0.040486666 -0.015930650
## Tromsoe 0.005756700 -0.009852817 0.015609517
## Vaalerenga 0.007147494 0.031030079 -0.023882585
The attack and defence strengths of teams are closely related to the ranking of the teams so far as you can see in e).
A team of average attack strength has and similarly an defence attack team has Since the number of goals given the random effects follows Poisson distribution the expection and variance of number of goals are equal
betas = fixef(mod)$cond
goals_homeyes = exp(sum(betas))
goals_homeno = exp(betas[1])
c(goals_homeyes = as.numeric(goals_homeyes), goals_homeno = as.numeric(goals_homeno))
## goals_homeyes goals_homeno
## 1.701265 1.132258
First, it holds that if a random variable follows a normal distribution with mean 0 and variance then X follows a Lognormal distribution with mean and variance
Since the random variables and are independent, the sum is normally distributed with mean equal to the sum of the corresponding means and variance equal to the sum of the corresponding variances. We write that
To derive we use the low of total expectation
To derive we use the low of total variance
(tau2_attack = VarCorr(mod)$cond$attack[1])
## [1] 0.007477577
(tau2_defence= VarCorr(mod)$cond$defence[1])
## [1] 0.01638331
tau2 = tau2_attack + tau2_defence
marg_exp_homeyes = exp(sum(betas) + 0.5*(tau2) )
marg_var_homeyes = marg_exp_homeyes + exp(2*sum(betas)) * (exp(tau2)-1) * exp(tau2)
c(marg_exp_homeyes=as.vector(marg_exp_homeyes), marg_var_homeyes = as.vector(marg_var_homeyes))
## marg_exp_homeyes marg_var_homeyes
## 1.721683 1.793262
marg_exp_homeno = exp(betas[1]+ 0.5*(tau2) )
marg_var_homeno = marg_exp_homeno + exp(2*betas[1]) * (exp(tau2)-1) * exp(tau2)
c(marg_exp_homeno=as.numeric(marg_exp_homeno), marg_var_homeno = as.numeric(marg_var_homeno))
## marg_exp_homeno marg_var_homeno
## 1.145847 1.177552
For the home team we have
var_game_prop_hy = marg_exp_homeyes/marg_var_homeyes
var_strength_prop_hy = exp(2*sum(betas)) * (exp(tau2)-1) * exp(tau2)/marg_var_homeyes
c(var_game_prop_hy =as.numeric(var_game_prop_hy),var_strength_prop_hy =as.numeric(var_strength_prop_hy))
## var_game_prop_hy var_strength_prop_hy
## 0.96008456 0.03991544
For the away team we have
var_game_prop_hn = marg_exp_homeno/marg_var_homeno
var_strength_prop_hn = exp(2*betas[1]) * (exp(tau2)-1) * exp(tau2)/marg_var_homeno
c(var_game_prop_hn =as.numeric(var_game_prop_hn),var_strength_prop_hn =as.numeric(var_strength_prop_hn))
## var_game_prop_hn var_strength_prop_hn
## 0.97307527 0.02692473
The test of significance of random effects and is equivalent to testing if the variance of each random effect and is 0 or greater than 0. We construct the following hypothesis testing
no_at_mod = glmmTMB(goals ~ home + (1|defence), poisson, data=long, REML=TRUE)
LRT_attack = as.numeric(2*(logLik(mod)-logLik(no_at_mod)))
pval = 0.5 * pchisq(LRT_attack, df = 1, lower.tail = FALSE) +
0.5 * pchisq(LRT_attack, df = 0 , lower.tail = FALSE)
pval
## [1] 0.2587558
no_de_mod = glmmTMB(goals ~ home + (1|attack), poisson, data=long, REML=TRUE)
LRT_defence = as.numeric(2*(logLik(mod)-logLik(no_de_mod)))
pval = 0.5 * pchisq(LRT_defence, df = 1, lower.tail = FALSE) +
0.5 * pchisq(LRT_defence, df = 0, lower.tail = FALSE)
pval
## [1] 0.09843786
To test the significance of of the fixed effect parameter for the home field advantage we must use the ordinary instead of the restricted likelihoods as the restricted likelihoods, at least in the LMM setting, involves different subsets of the data and are thus not comparable.
mod_ml = glmmTMB(goals ~ (1|attack) + (1|defence), poisson, data=long, REML=FALSE)
no_home_mod = glmmTMB(goals ~ (1|attack) + (1|defence), poisson, data=long, REML=FALSE)
LRT_home = as.numeric(2*(logLik(mod_ml)-logLik(no_home_mod)))
pchisq(LRT_home, df = 1, lower.tail = F)
## [1] 1
rankings = function(data) {
data = data[complete.cases(data),]
tmp =
data.frame(
team = levels(data$attack),
points = 0,
goalsscored = 0,
goalsconceeded = 0,
scorediff = 0
)
# add points to each team going through each match separately
for (i in seq(1, nrow(data), by = 2)) {
j = data[i,]$attack
k = data[i,]$defence
if (data[i,]$goals > data[i + 1,]$goals)
tmp[tmp$team == j,]$points =
tmp[tmp$team == j,]$points + 3 # team j won
else if (data[i,]$goals < data[i + 1,]$goals)
tmp[tmp$team == k,]$points =
tmp[tmp$team == k,]$points + 3 # team k won
else {
tmp[tmp$team == j,]$points =
tmp[tmp$team == j,]$points + 1 # draw
tmp[tmp$team == k,]$points =
tmp[tmp$team == k,]$points + 1
}
}
# compute goal differences and goals scored
for (j in levels(data$attack)) {
tmp[tmp$team == j, ]$goalsscored =
sum(data[data$attack == j,]$goals)
tmp[tmp$team == j, ]$goalsconceeded =
sum(data[data$defence == j,]$goals)
}
tmp$scorediff = tmp$goalsscored - tmp$goalsconceeded
# compute the rankings
tmp$rank = data.table::frankv(
tmp,
col = c("points", "scorediff", "goalsscored"),
ties.method = "random",
order = -1
)
# return everything
tmp[order(tmp$points, decreasing = T),]
}
rnk = rankings(long)
rownames(rnk) = NULL
rnk
## team points goalsscored goalsconceeded scorediff rank
## 1 Rosenborg 52 43 20 23 1
## 2 Brann 48 36 23 13 2
## 3 Molde 43 48 30 18 3
## 4 Haugesund 41 36 28 8 4
## 5 Ranheim_TF 38 38 40 -2 5
## 6 Vaalerenga 36 35 37 -2 6
## 7 Odd 34 35 29 6 7
## 8 Tromsoe 33 35 33 2 8
## 9 Sarpsborg08 32 39 34 5 9
## 10 Kristiansund 31 32 35 -3 10
## 11 BodoeGlimt 27 28 30 -2 11
## 12 Stroemsgodset 26 38 38 0 12
## 13 Lillestroem 25 26 37 -11 13
## 14 Stabaek 23 29 43 -14 14
## 15 Start 23 24 42 -18 15
## 16 Sandefjord_Fotball 15 24 47 -23 16
lambdas = predict(mod, newdata = long, type = "response")
sim_rank = matrix(NA,nrow = 1000, ncol = 16)
for (i in 1:1000) {
res = long
res$goals = rpois(length(lambdas), lambda = lambdas)
r = rankings(res)
sim_rank[i,] = as.character(r$team)
}
nm = levels(long$attack) # team names
lnm = length(nm) # total number of teams
probs = matrix(0, nrow = lnm, ncol = lnm) # matrix to store the probabilities of rankings
rownames(probs) = nm
colnames(probs) = paste0(1:lnm,"-place")
for (j in 1:ncol(sim_rank)) {
prob = table(sim_rank[,j])/1000 # calculate probabilities of j-th place
nms = names(prob) # team names of corresponding probabilities
for(i in 1:length(prob)){
ind = which(nm == nms[i]) # find row of i-th name
probs[ind,j] = as.numeric(prob[i]) # store probability
}
}
> probs
1-place 2-place 3-place 4-place 5-place 6-place 7-place 8-place 9-place
BodoeGlimt 0.059 0.071 0.069 0.064 0.076 0.061 0.058 0.072 0.061
Brann 0.127 0.101 0.112 0.094 0.076 0.068 0.067 0.060 0.064
Haugesund 0.085 0.082 0.082 0.089 0.074 0.067 0.086 0.073 0.056
Kristiansund 0.052 0.054 0.076 0.067 0.052 0.050 0.069 0.079 0.059
Lillestroem 0.031 0.030 0.043 0.063 0.044 0.074 0.069 0.063 0.055
Molde 0.117 0.106 0.084 0.082 0.096 0.079 0.066 0.071 0.063
Odd 0.070 0.091 0.091 0.080 0.086 0.070 0.057 0.077 0.065
Ranheim_TF 0.039 0.048 0.057 0.048 0.057 0.070 0.063 0.046 0.094
Rosenborg 0.182 0.132 0.087 0.105 0.083 0.080 0.067 0.051 0.048
Sandefjord_Fotball 0.010 0.015 0.017 0.020 0.033 0.030 0.041 0.036 0.059
Sarpsborg08 0.054 0.058 0.072 0.060 0.075 0.073 0.068 0.072 0.078
Stabaek 0.019 0.021 0.032 0.029 0.043 0.048 0.054 0.050 0.053
Start 0.014 0.024 0.019 0.035 0.041 0.043 0.039 0.046 0.059
Stroemsgodset 0.040 0.066 0.047 0.048 0.049 0.062 0.064 0.060 0.056
Tromsoe 0.063 0.057 0.060 0.055 0.055 0.063 0.073 0.074 0.071
Vaalerenga 0.038 0.044 0.052 0.061 0.060 0.062 0.059 0.070 0.059
10-place 11-place 12-place 13-place 14-place 15-place 16-place
BodoeGlimt 0.066 0.081 0.056 0.060 0.053 0.051 0.042
Brann 0.048 0.053 0.042 0.027 0.028 0.015 0.018
Haugesund 0.052 0.056 0.053 0.043 0.051 0.030 0.021
Kristiansund 0.060 0.062 0.075 0.062 0.080 0.058 0.045
Lillestroem 0.072 0.059 0.076 0.082 0.078 0.075 0.086
Molde 0.039 0.035 0.053 0.036 0.034 0.024 0.015
Odd 0.056 0.058 0.043 0.054 0.035 0.039 0.028
Ranheim_TF 0.066 0.074 0.077 0.083 0.061 0.066 0.051
Rosenborg 0.027 0.038 0.026 0.020 0.024 0.021 0.009
Sandefjord_Fotball 0.078 0.055 0.083 0.082 0.111 0.128 0.202
Sarpsborg08 0.059 0.064 0.057 0.065 0.053 0.052 0.040
Stabaek 0.097 0.063 0.071 0.085 0.090 0.113 0.132
Start 0.073 0.082 0.081 0.095 0.078 0.120 0.151
Stroemsgodset 0.062 0.074 0.070 0.078 0.089 0.076 0.059
Tromsoe 0.082 0.072 0.067 0.057 0.059 0.043 0.049
Vaalerenga 0.063 0.074 0.070 0.071 0.076 0.089 0.052
>
expect_ranking = matrix(0, nrow = nrow(probs))
rownames(expect_ranking) = rownames(probs)
for(i in 1:nrow(probs)){
expect_ranking[i,] = sum(1:16*probs[i,])
}
t(expect_ranking)
## BodoeGlimt Brann Haugesund Kristiansund Lillestroem Molde Odd Ranheim_TF
## [1,] 8.123 6.177 7.144 8.56 9.558 6.415 7.234 8.995
## Rosenborg Sandefjord_Fotball Sarpsborg08 Stabaek Start Stroemsgodset
## [1,] 5.386 11.775 8.178 10.718 11.041 9.172
## Tromsoe Vaalerenga
## [1,] 8.384 9.14
lambdas = predict(mod, newdata = long[!complete.cases(long),], type = "response")
sim_rank = matrix(NA,nrow = 1000, ncol = 16)
for (i in 1:1000) {
res = long
res$goals[!complete.cases(res$goals)] = rpois(length(lambdas), lambda = lambdas)
r = rankings(res)
sim_rank[i,] = as.character(r$team)
}
nm = levels(long$attack) # team names
lnm = length(nm) # total number of teams
probs = matrix(0, nrow = lnm, ncol = lnm) # matrix to store the probabilities of rankings
rownames(probs) = nm
colnames(probs) = paste0(1:lnm,"-place")
for (j in 1:ncol(sim_rank)) {
prob = table(sim_rank[,j])/1000 # calculate probabilities of j-th place
nms = names(prob) # team names of corresponding probabilities
for(i in 1:length(prob)){
ind = which(nm == nms[i]) # find row of i-th name
probs[ind,j] = as.numeric(prob[i]) # store probability
}
}
> probs
1-place 2-place 3-place 4-place 5-place 6-place 7-place 8-place 9-place
BodoeGlimt 0.000 0.000 0.000 0.000 0.001 0.003 0.014 0.034 0.080
Brann 0.252 0.669 0.073 0.006 0.000 0.000 0.000 0.000 0.000
Haugesund 0.000 0.031 0.371 0.368 0.161 0.049 0.014 0.005 0.001
Kristiansund 0.000 0.000 0.000 0.001 0.025 0.091 0.117 0.173 0.210
Lillestroem 0.000 0.000 0.000 0.000 0.000 0.000 0.002 0.006 0.023
Molde 0.006 0.058 0.486 0.346 0.082 0.019 0.000 0.003 0.000
Odd 0.000 0.000 0.003 0.026 0.114 0.188 0.225 0.205 0.128
Ranheim_TF 0.000 0.004 0.037 0.186 0.345 0.202 0.125 0.066 0.024
Rosenborg 0.742 0.238 0.019 0.001 0.000 0.000 0.000 0.000 0.000
Sandefjord_Fotball 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Sarpsborg08 0.000 0.000 0.001 0.004 0.032 0.070 0.116 0.157 0.208
Stabaek 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.002 0.009
Start 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.002 0.004
Stroemsgodset 0.000 0.000 0.000 0.000 0.000 0.001 0.004 0.026 0.043
Tromsoe 0.000 0.000 0.001 0.010 0.054 0.131 0.175 0.185 0.180
Vaalerenga 0.000 0.000 0.009 0.052 0.186 0.246 0.208 0.136 0.090
10-place 11-place 12-place 13-place 14-place 15-place 16-place
BodoeGlimt 0.149 0.247 0.238 0.130 0.077 0.027 0.000
Brann 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Haugesund 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Kristiansund 0.186 0.126 0.047 0.020 0.003 0.001 0.000
Lillestroem 0.050 0.100 0.169 0.271 0.233 0.142 0.004
Molde 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Odd 0.080 0.024 0.005 0.002 0.000 0.000 0.000
Ranheim_TF 0.010 0.001 0.000 0.000 0.000 0.000 0.000
Rosenborg 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Sandefjord_Fotball 0.000 0.000 0.000 0.001 0.009 0.067 0.923
Sarpsborg08 0.221 0.120 0.053 0.016 0.002 0.000 0.000
Stabaek 0.022 0.084 0.131 0.213 0.283 0.241 0.015
Start 0.009 0.037 0.063 0.154 0.237 0.437 0.057
Stroemsgodset 0.088 0.152 0.261 0.187 0.152 0.085 0.001
Tromsoe 0.139 0.085 0.030 0.006 0.004 0.000 0.000
Vaalerenga 0.046 0.024 0.003 0.000 0.000 0.000 0.000
expect_ranking = matrix(0, nrow = nrow(probs))
rownames(expect_ranking) = rownames(probs)
for(i in 1:nrow(probs)){
expect_ranking[i,] = sum(1:16*probs[i,])
}
t(expect_ranking)
## BodoeGlimt Brann Haugesund Kristiansund Lillestroem Molde Odd Ranheim_TF
## [1,] 8.123 6.177 7.144 8.56 9.558 6.415 7.234 8.995
## Rosenborg Sandefjord_Fotball Sarpsborg08 Stabaek Start Stroemsgodset
## [1,] 5.386 11.775 8.178 10.718 11.041 9.172
## Tromsoe Vaalerenga
## [1,] 8.384 9.14
library(ggplot2)
nm = levels(long$attack)
pos = matrix(NA, nrow = nrow(sim_rank), ncol = ncol(sim_rank))
for(i in 1:nrow(pos)){
pos[i,] = match(nm, sim_rank[i,])
}
pos = data.frame(pos)
colnames(pos) = nm
pos_plot = tidyr::gather(pos, key = "team", value = "pos")
ggplot(data = pos_plot, aes(pos)) + geom_bar() + facet_wrap(~team)
We observe that the uncertainty regarding the first and last ranking positions is smaller compared to the in between rankings.
exp_pos = colMeans(pos)
sd_pos = apply(pos,2,sd)/sqrt(1000)
dfplot = data.frame(team = nm, exp_pos = exp_pos, sd_pos = sd_pos, strength = m$strength)
bs = lm(exp_pos~strength, data = dfplot)$coefficients
pd <- position_dodge(0.1)
ggplot(dfplot, aes(x=strength, y=exp_pos, colour = team)) +
geom_point()+
geom_errorbar(aes(ymin=exp_pos-1.96*sd_pos, ymax=exp_pos+1.96*sd_pos)) +
geom_abline(intercept = bs[1] ,slope = bs[2], colour = "grey")
Unsurprisingly, in the plot it seems that there is a strong relation between the expected ranking and team strengths. The relationship between the estimated expected rankings based on simulation of the whole series and the difference between attack and defence strength of each team does not seem to be completely montonic but this may be because of Monte carlo error so it remains somewhat unclear if the relationship is monotonic or not.
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] glmmTMB_1.0.2.1 lme4_1.1-23 Matrix_1.2-18 reshape2_1.4.4
## [5] ggplot2_3.3.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.5 TMB_1.7.18 nloptr_1.2.2.2 pillar_1.4.6
## [5] compiler_4.0.3 plyr_1.8.6 tools_4.0.3 boot_1.3-25
## [9] digest_0.6.27 statmod_1.4.35 nlme_3.1-149 evaluate_0.14
## [13] lifecycle_0.2.0 tibble_3.0.4 gtable_0.3.0 lattice_0.20-41
## [17] pkgconfig_2.0.3 rlang_0.4.8 yaml_2.2.1 xfun_0.19
## [21] withr_2.3.0 dplyr_1.0.2 stringr_1.4.0 knitr_1.30
## [25] generics_0.1.0 vctrs_0.3.5 grid_4.0.3 tidyselect_1.1.0
## [29] data.table_1.13.2 glue_1.4.2 R6_2.5.0 rmarkdown_2.5
## [33] minqa_1.2.4 farver_2.0.3 tidyr_1.1.2 purrr_0.3.4
## [37] magrittr_1.5 MASS_7.3-53 scales_1.1.1 ellipsis_0.3.1
## [41] htmltools_0.5.0 splines_4.0.3 colorspace_2.0-0 labeling_0.4.2
## [45] stringi_1.5.3 munsell_0.5.0 crayon_1.3.4